Method and system for orchestrating multi-party services using semi-cooperative nash equilibrium based on artificial intelligence, neural network models,reinforcement learning and finite-state automata

ABSTRACT

Distributing resources in a predetermined geographical area, including: retrieving a set of metrics indicative of factors of interest related to operation of the resources for at least two parties, each having a plurality of resources, retrieving optimization policies indicative of preferred metric values for each party, retrieving at least one model including strategies for distributing resources in the predetermined area, the at least one model based on learning from a set of scenarios for distributing resources, retrieving context data from real time systems indicative of at least a present traffic situation, establishing a Nash equilibrium between the metrics in the optimization policies of the at least two parties taking into account the at least one model and the context data, distributing the resources in the geographical area according to the outcome of the established Nash equilibrium.

CROSS-REFERENCE TO RELATED APPLICATION

The present patent application/patent claims the benefit of priority ofU.S. Provisional Patent Application No. 62/668,904, filed on May 9,2018, and entitled “METHOD AND SYSTEM FOR ORCHESTRATING MULTI-PARTYSERVICES USING SEMI-COOPERATIVE NASH EQUILIBRIUM BASED ON ARTIFICIALINTELLIGENCE, NEURAL NETWORK MODELS, REINFORCEMENT LEARNING ANDFINITE-STATE AUTOMATA,” the contents of which are incorporated in fullby reference herein.

FIELD OF THE INVENTION

The present invention relates to a method and system for distributingresources in a predetermined geographical area.

BACKGROUND OF THE INVENTION

In recent years, human-assisted self-driving vehicles and fullyautonomous vehicles have received more attention. An autonomous vehiclemay be able to navigate a transport through a city by itself, withoutany active interference by a human operator.

An autonomous vehicle requires relatively complicated programming andmachine learning algorithms to be able to make fast and accuratedecisions in real-time. In human assisted self-driving vehicles, thereis still a human operator to control the vehicle in some criticalsituations.

For a group of autonomous vehicles to drive in an area, such as a city,and avoid collisions, it may be perceived that they share information,such as location, speed, and travelling direction between each other.The vehicles may also be equipped with proximity sensors and cameras foridentifying obstacles and objects near the vehicle. Accordingly, fortravelling through the city, the vehicles may identify and avoid objectsnear the vehicle, as well as plan routes by knowledge about othervehicles near the vehicle.

By the introduction of autonomous vehicles or human-assistedself-driving vehicles, transportation for people and delivery servicesmay be provided by fleets of self-driving vehicles. The driving controlof autonomous vehicles for specific traffic situations is becoming wellexplored, however, over a large scale, such as an entire city, it is ofinterest how to distribute the vehicles, or other service units, acrossthe city in the most efficient way.

Accordingly, there is a need for ways of distributing of service unitsacross an area such as to meet a service demand in the city.

SUMMARY OF THE INVENTION

In view of the above, it is an object of the present invention toprovide an improved method for distributing resources in a predeterminedgeographical area.

According to a first aspect of the invention, there is provided a methodfor distributing resources in a predetermined geographical area, themethod including: retrieving a set of metrics, the metrics indicative offactors of interest related to operation of the resources for at leasttwo parties, each party having a plurality of resources, retrievingoptimization policies indicative of preferred metric values for each ofthe at least two parties, retrieving at least one model includingstrategies for distributing resources in the predetermined area, the atleast one model based on learning from a set of scenarios fordistributing resources, retrieving context data from real-time systems,the context data indicative of at least a present traffic situation,establishing a Nash equilibrium between the metrics in the optimizationpolicies of the at least two parties taking into account the at leastone model and the context data, distributing the resources in thegeographical area according to the outcome of the established Nashequilibrium.

The present invention is based on the realization to apply Nashequilibrium to optimization policy metrics and at the same time takeinto account models including strategies for the distribution ofresources in order to find an at the moment advantageous distribution ofthe resources in the predetermined area. Further, it is realized toapply Nash equilibrium to a continuous dynamic process, e.g. toresources being mobile in the predetermined area without discretestates, which means that the conditions for the resources may change,and not necessarily in a deterministic way as is the case with discreteprocesses. For example, an optimization policy may suddenly change,whereby the Nash equilibrium will also change, whereby the inventiveconcept takes that into account by using the Nash equilibrium todetermine the distribution of resources.

Nash equilibrium is a state where one party will not improve itsposition by changing its optimization policy while the other partiesmaintain their optimization policies. Thus, the Nash equilibrium may bea steady state with respect to the metrics that a party cares about. Forexample, for a city, the Nash equilibrium may be max revenue, the roadsfull, but not congested and the customers have enjoyed the mobilityservice (as reflected by ratings). The property of Nash equilibrium isthus that the participants do not have an incentive to change theirparameters (e.g. price, street conditions, etc.).

The Nash equilibrium in accordance with the inventive concept may be asemi-static Nash equilibrium, which may still allow a small deviation inthe parties' model strategy without compromising the Nash equilibriumstate. Thus, Nash equilibrium in accordance with the inventive conceptmay be considered to be in a state of equilibrium even if the Nashequilibrium is not fully satisfied, but within a tolerance (ϵ).

The metrics are factors that are important for each of the parties.Example metrics are wait time, cost, revenue, traffic flow, deliverytime, experience, explored area, brand value, business risk, retentionrate, market share, popularity, gross revenue to city, etc. This list ofmetrics is non-exhaustive.

The optimization policies are the preferred values for each of a set ofmetrics that are of importance for the specific party. Further, themetrics of the optimization policies may be weighted, for instance0.75*Revenue+0.25*Retention.

A model provides a set of algorithms that describe possible outcomesbased on input data. The model consists of constants, parameters,probabilities, action trees, graphs with edges and nodes and so forth.The trees and graphs would have their own multiple attribute sets atevery node and edge. A model may take e.g. context data and optimizationpolicies as input that provide a predicted outcome based on previoustraining and the input. In establishing Nash equilibrium, the differentpossible outcomes are weighted against each other, and the optimizationpolicies may have to be changed in order to establish Nash equilibrium.From establishing Nash equilibrium, the resources may be distributed inthe geometrical area in an at-the-moment most advantageous distributionfor the parties.

A strategy may be e.g. a plan, a policy and a course of action. Forexample, every distribution state has a set of next actions and a set ofrewards associated with each action. Such rewards may be local orcumulative. One way of expressing a policy is with a probabilitydistribution for every action. For example, a policy might be to movethe resources randomly if the utilization is less than 25%. Anotherpolicy might be to triple the price if a higher demand is determined(e.g. if the optimization policy includes to maximize revenue).Distributing of resources may refer to several types of resources. Forinstance, move autonomous vehicles to more advantageous location in thepredetermined area with respect to mobility demand or charging (in thecase of electric vehicle for utilizing the power grid in a safe andefficient way), explore the city in order to learn more for the specificautonomous vehicle or orchestrate vehicle related functions, such ascleaning and maintenance services.

A party may, for example, be a mobility service provider or a packagedelivery operator, a charging station operator or a city. Thus, theestablished Nash equilibrium could be between a first party being amobility service provider and a second party being the city itself.Accordingly, the city may have in its optimization policy to avoid highvolume traffic on certain routes in order to avoid traffic congestion,and the mobility service provider may want to maximize profit. In Nashequilibrium, the mobility service provider may not send all itsresources to the routes indicated in the city's optimization policybecause it may cause traffic congestions, which mean the resources willbe stuck in traffic and not provide profit.

Another possible scenario is that a charging station provider forelectric vehicles may desire that only a maximum number of electricvehicles are charged at the same time. In Nash equilibrium, a mobilityservice provider managing electric vehicles may thus not send all itsvehicles for charging at the same time.

A further scenario is that a first party is a mobility service providerand a second party is a customer requesting a mobility service. Theremay further be several parties being mobility service providers thatcompete for the customer.

In some embodiments, there is included to receive a request for aresource, the request including a set of preferred metrics, andestablish the Nash equilibrium based further on the request. The requestmay come from a customer with its own preferences. For example, astudent travelling on a Tuesday morning may have a set of preferredmetrics that differ from the preferred metrics of high-end restaurantvisitor on a Saturday night. Nash equilibrium is then establishedbetween the preferred metrics and the optimization metrics of theparties.

Further, in some embodiments, an offer may be provided based on therequest and the outcome of the established Nash equilibrium, receiving aresponse to the offer, and distributing the resources further based onthe response. Accordingly, the outcome of establishing Nash equilibrium(or at least near Nash equilibrium) may result in an offer made to thesource of the request. The source of the request may choose to accept ordecline the offer.

The models may be trained from reinforcement learning algorithms.Accordingly, by using reinforcement learning the model may be trainedbased on optimization policies and by using Nash equilibrium as a rewardfunction for the reinforcement learning. The reinforcement learningalgorithm may be a deep reinforcement learning algorithm.

In some possible implementations, the deep reinforcement learningalgorithm may be a multi-layer convolutional neural network includingoptional recurrent or recursive layer.

According to some possible embodiments, the method may includecalculating adaptation factors for a further geographical area notincluded in the predetermined area based on at least area size andpopulation density at places of interest, scaling properties of themodel for the predetermined area to the further area to form an adaptedmodel, and using the adapted model for distributing resources in thefurther area. Accordingly, if the models have been developed for apredetermined area, for example the city of Los Angeles, it may bepossible to scale those models to a smaller city, such as Las Vegas, andthen use the adapted models for implementing the inventive concept inthe city not previously modelled. Accordingly, the method mayadvantageously be used for other areas than the predetermined area. Theadaptation factors may be based on city size, city area, demographics,number of vehicles, etc. The demand model may be a modelled distributionof demand across the predetermined area.

According to further embodiments, the method may include constructing ademand model based on a demand meta-model, context data and place, thedemand model adapted to predict transportation need, constructing anacceptance model based on the demand model, party mobility preferences,and transportation options, the acceptance model adapted to predicttransportation preferences, generating an intent model based on thearrival point and venue, and establishing the Nash equilibrium basedfurther on the demand model, the acceptance model, and the intent model.

The context may be e.g. time of day, bridge opening, traffic intensity,whether there is an event in the area (e.g. concert, game, etc.). Ademand model may provide a distribution of resource demand throughoutthe city. For example, that there may be high or low demand in a certainsub-area. The acceptance model may provide what type of transportationthat is acceptable for a resource enquirer. With the demand acceptanceand choice models is it possible to distribute the resources moreaccurately with respect to demand in the predetermined area.

The acceptance model may further be based directly on the context data,instead of indirectly from the demand model.

The demand acceptance and choice models may be generated based onreinforcement learning. Accordingly, the method may include receivingreal time context data, and updating the demand model, the acceptancemodel, and the choice models based on the real time context data toimprove the models further.

In addition, to improve the accuracy of the predicting the demand, themethod may include generating the optimization policies based on thedemand model, the acceptance model, and the choice model, andreinforcement learning algorithms.

The resources may be mobility units, such as goods, boats, semi-selfdriving vehicles, autonomous vehicles, etc.

The reinforcement learning mechanism may be for example at least one ofPartially Observable Markov Decision Process, Policy Gradient, DeepQLearning, Actor Critic Method, Monte Carlo tree search, and theCounterfactual Regret Minimization technique.

According to embodiments of the invention, Nash equilibrium may besatisfied when the equilibrium between the sets of metrics in theoptimization policies is within an allowed deviation (ϵ).

According to a second aspect of the invention, there is provided asystem for distributing resources in a predetermined geographical area,the system including: a control unit configured to: retrieve a set ofmetrics, the metrics indicative of factors of interest related tooperation of the resources for at least two parties, each party having aplurality of resources, retrieve optimization policies indicative ofpreferred metric values for each of the at least two parties, retrieveat least one model including strategies for distributing resources inthe predetermined area, the at least one model based on learning from aset of scenarios for distributing resources, retrieve context data fromreal-time systems, the context data indicative of at least a presenttraffic situation, establishing a Nash equilibrium between the metricsin the optimization policies of the at least two parties taking intoaccount the at least one model and the context data, and distribute theresources in the geographical area according to the outcome of theestablished Nash equilibrium.

The system may further include a simulator module configured to:generate the model strategies based on reinforcement learningalgorithms.

In some embodiments, the simulator module may be configured to: generatea demand model based on a demand meta-model, a context and place, thedemand model adapted to predict transportation need, generate anacceptance model based on the demand model, party mobility preferences,and transportation options, the acceptance model adapted to predicttransportation preferences, generate an intent model based on thearrival coordinate and venue, wherein the control unit module isconfigured to: establish the Nash equilibrium based further on thedemand model, the acceptance model, and the intent model.

The control unit and the simulator module may be arranged on the server.

This second aspect of the invention provides similar advantages asdiscussed above in relation to the previous aspect of the invention.

According to a third aspect of the invention, there is provided acomputer program product including a computer readable medium havingstored thereon computer program means for distributing resources in apredetermined geographical area, wherein the computer program productincludes: code for retrieving a set of metrics from a simulation, themetrics indicative of factors of interest related to operation of theresources for at least two parties, each party having a plurality ofresources, code for retrieving optimization policies indicative ofpreferred metric values for each of the at least two parties, code forretrieving at least one model including strategies for distributingresources in the predetermined area, the at least one model based onlearning from a set of scenarios for distributing resources, code forretrieving context data from real time systems, the context dataindicative of at least a present traffic situation, code forestablishing a Nash equilibrium between the sets metrics in theoptimization policies of the at least two parties taking into accountthe at least one model and the context data, and code for distributingthe resources in the geographical area according to the outcome of theestablished Nash equilibrium.

This third aspect of the invention provides similar advantages asdiscussed above in relation to the above mentioned aspects of theinvention.

Further features of, and advantages with, the present invention willbecome apparent when studying the appended claims and the followingdescription. The skilled person will realize that different features ofthe present invention may be combined to create embodiments other thanthose described in the following, without departing from the scope ofthe present invention.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects of the present invention will now be describedin more detail, with reference to the appended drawings showing exampleembodiments of the invention, wherein:

FIG. 1 conceptually illustrates the application of embodiments of theinvention;

FIG. 2 is a conceptual functional flowchart illustrating embodiments ofthe invention;

FIG. 3 conceptually illustrates the generation of a demand model, anacceptance model and a choice model;

FIG. 4 is a table of example metrics for several parties;

FIG. 5 conceptually illustrates a simulator module operation flow;

FIG. 6 schematically illustrates a reinforcement learning process inaccordance with the inventive concept;

FIG. 7 schematically illustrates a reinforcement learning process forconstructing Nash equilibrium in accordance with the inventive concept;and

FIG. 8 is a flow chart of method steps according to embodiments of theinvention.

DETAILED DESCRIPTION OF EXAMPLARY EMBODIMENTS

In the present detailed description, various embodiments of the systemand method according to the present invention are mainly described withreference to distributing resources in the form of a vehicle. However,the present invention may equally be used with other resources, such ascharging stations for electric vehicles, parking lots, package deliverysystems, metro lines planning, bike sharing distributions, publictransportation planning, etc. Thus, this invention may be embodied inmany different forms and should not be construed as limited to theembodiments set forth herein; rather, these embodiments are provided forthoroughness and completeness, and fully convey the scope of theinvention to the skilled person. Like reference characters refer to likeelements throughout.

FIG. 1 conceptually illustrates application of the invention. In FIG. 1, two parties each have a fleet of autonomous vehicles (orhuman-assisted self-driving vehicles). Party 102 has three autonomousvehicles, 103 a, 103 b, and 103 c, in its fleet, and party 105 has threeautonomous vehicles, 106 a, 106 b, and 106 c, in its fleet. Here, forclarity, only three vehicles are shown in each fleet, and only for twoparties. However, the invention is applicable to several parties havingany number of resources in its fleet, e.g. hundreds of vehicles.Furthermore, a party may also be the city itself (e.g. theinfrastructure), charging station provider, etc.

The autonomous vehicles 103 a-c, 106 a-c of the two fleets compete in apredetermined area 100 including various agents, such as places ofinterest (such as stadiums, museums, parks, etc.), cars, roads, roadworks, parking spaces, charging stations, bridges, tunnels, etc. (notnumbered). The objectives of the fleets are to provide mobility servicesto customers in the predetermined area 100. In some embodiments, thefleets have access to the choice and acceptance models of theircustomers, which are probabilistic distributions describingtransportation preferences and preferred activities of their customersat the present time and day. The acceptance models and choice modelswill be described further with reference to FIG. 3 .

In order to understand the real world and thereby operate in thepredetermined area 100, control units of the vehicles 103 a-c, 106 a-crun simulations, preferably in the cloud 104, which generates a vastnumber of scenarios (e.g. thousands or more) running thousands (or more)of times using reinforcement learning. From the simulation runs, theymay arrive at values for strategies for models which assist the controlunit to navigate and operate the respective vehicle in the predeterminedarea 100. Each of the parties 102 and 105 has its respectiveoptimization policy, which provides a strategy for a Nash equilibriumsimulation. Then, in simulation, Nash equilibrium is derived where modelstrategies converge to a sustainable state (Nash equilibrium). Thevarious agents (e.g. places of interest, cars, roads, road works,parking spaces, charging stations, bridges, tunnels, etc.) in thesimulated grid worlds may be modelled with lifetimes and behaviors. Theparties 102, 105 can then deploy their model strategies in thepredetermined area (i.e. the “real world”) and also learn from the realworld to refine their models.

FIG. 2 illustrates a functional flow-chart for embodiments of theinvention. FIG. 2 illustrates a system 200 for distributing resources. Acontrol unit 201 for orchestration of the resources is configured toreceive context data 202 including information about the present trafficsituation, e.g. example traffic feeds, event details, weather, status ofelectrical charging stations, etc. The control unit 201 further receivesmodels 210 of the agents (e.g. stadiums, museums, parks, cars, roads,road works, parking spaces, charging stations, bridges, tunnels, etc.),i.e. behavior of the agents in different situations, and models 203 forstate transitions, i.e. parking state transitions, charging statetransitions, exploring transitions, and moving transitions for theautonomous vehicle.

The exploring transitions occur for instance when an autonomous car isnot busy and the system deploys the vehicles to explore a coverage area.The moving transition may be an operation decision based on thediscrepancy between the predicted demand and the location of thevehicles. For example, vehicles 103 a-c may be at the outskirts of thepredetermined area 100 but a demand is projected at a soon to be overconcert or ball game. In this case, in order to minimize the wait time,the vehicles could be controlled to travel to a location near thelocation of the predicted increased demand.

Further, a simulator module 204 is configured to generate modelsincluding strategies for distributing resources based on reinforcementlearning and to provide the models 206 to the control unit 201.Moreover, the control unit 201 retrieves optimization policies 208indicative of preferred metric values for each of at least two partiesoperating in the predetermined area 100. The control unit 201 outputscommands 212 and controls to affect the resources in the predeterminedarea 100.

The system 200 includes is a set of machine learning modules and neuralnetworks, as well as rules, supervisory controls and other modules.

The system control unit 201 also feeds back S202 into the simulatormodule 204, for it to learn from real time actual events. This feedbackloop is advantageous for learning because, as much as the simulatormodule 204 can simulate a large number of conditions, it cannot stillfully comprehend how the agents behave in real world. The real worldfeedback data and model strategies would have appropriately higherweightage to influence and complement the simulation learnings by thesimulator module 204.

FIG. 3 conceptually illustrates the generation of a demand model 307, anacceptance model 308 and an intent model 309. The models 307-309 areprovided in the form of probabilistic distributions such as Poissondistributions.

The simulator module (204 in FIG. 2 ) receives meta-models (301, 302,303, 304) for data for various parts of the input system. Themeta-models (301, 302, 303, 304) may be created from actual data ofdifferent modalities (e.g. different types of transportations, weatherdata, time, different user groups based on e.g. age, demographics, etc.)and then customized for a specific instance of simulation by adaptationfactors 305 (e.g. if a different predetermined area is considered) andthen dynamically adjusted by real time context 306. In short, the modelsoutcomes (307, 308, 309) are constructed from meta-models 301-304trained by actual data from different and varied sources and thenadopted by specific factors (305) and then scaled by context 306. Theactual data used for training the models may be taxi cab data, gameattendance, concert details, traffic conditions and so forth for 1-2years (spanning holidays, working days, cold days, summer days, etc.)from potential major cities.

The demand prediction meta-model 301 may be a deep learning neuralnetwork model that is trained for a particular predetermined area,usually a geographic location having associated context data 306. Thecontext data 306 may include time of day, day of week, day of month,holidays, places of interest, special events, etc. The model 301 wouldhave been trained in a deep learning neural network with lots ofdifferent data such as cab hailings, transport data from publictransport, government data (for example vehicle data that is to bereported mandatory in various countries).

This meta-demand model 301 may for instance be able to predict demand ona Monday morning (students and workers), Thanksgiving day (holidaycrowd), to and from a concert or a ball game, weekend demand, touristdemand on a summer day, home returning crowd on a rainy or snowy dayevening and so forth.

In addition, the demand prediction meta-model 301 is aggregated acrossall transportation modes (walking, cabs, shared rides, publictransportation, park-n-go and so forth). Accordingly, the demandmeta-model 301 may provide a distribution of the demand at a given timeinterval (say 9:00 AM-10:00 AM Monday morning January in New York) basedon the parameters. Normally this distribution is quite limited in scopeand may be used in a relatively restricted fashion.

A first step to generate the demand model 307 for a specific context andpredetermined area is to apply an adaptation model (305) that adoptsthis specific model 301 to a different condition—for example a TuesdayMorning student commute on a winter day in New York can be scaled (notlinearly but based on a neural model) to a January Tuesday morningtraffic in Gothenburg, where there is a similarity in the availabilityof public transport and in the climate. Adaptation of the model 301 toLos Angeles where the climate and transportation options are differentmay need a different model that is directionally correct but needsadditional parameters.

The next step is to apply context 306 (i.e. by inputting anotherparameter into the trained model) to the demand meta-model 301. Forinstance, it may occur that this Tuesday in January is a local holiday,and there is a traffic jam, or a bridge closing, or a playoff ball gameand the local team is a favorite in which case there would likely be afull crowd till the end of the game. The contextual scaling provided bythe context data 306 provides advantageous ability to simulate a varietyof possibilities and learn from them.

The demand meta-model 301 after the inclusions of the adaptation factors305 and the context layer 306 results in a demand model 307 thatprovides a probabilistic distribution (e.g. a Poisson distribution),i.e. how many people would need transportation from this generallocation at this hour and what is the arrival rate at the origin. Thearrival at a location may be modelled as a poisson distribution and thearrival rate is a parameter to the distribution.

The adaptation factors 305 context 306 serves as the multi-dimensionalscaling factor for the predictions of demand. For example, the demandmodel 307 may predict that x % of a crowd will use public transport andy % will use autonomous cars. A demand model 307 based on data fromprevious events (ball games, concerts, etc.) at different cities wouldneed to be customized for this event (which might have less crowd size)and say weather (rainy or cold day as opposed to a sunny afternoon).These may not be only linear models, but may need a multi-dimensionalmulti modal complex model (like a deep neural network) which takes indynamic contextual input; in other words, the models may have aplurality of dimensions, with parameters and training may be performedusing data from different domains as well as a rich set of customizationvia context and adaptation factors. Thus the simulator module will feedin different contextual scenarios and run lots of simulations.

The preference meta-model 302 provides a persona-based preference thatis an overlay over the available transportation options 303. A mobilityride sharing entity can use the simulator module to add more ridesharing options and see how it can increase its share. Anotheralternative application may be that a metro operator can evaluate if itis possible to increase use of public transport by adding more trains orbuses to a specific route.

Accordingly, the preference model 302 is contextual and persona based.For instance, a student population can be incentivized to take publictransport by increasing the availability of public transport, but for anacademy Oscar award crowd, a mobility operator with higher end vehicleswould get more business; but for a music festival, more ride sharing orpublic transport may be an advantageous choice for the mobilityproviders (e.g. the parties). Moreover, preferences for the users ofmobility services may also be used for increasing the range and capacityof autonomous electric vehicles temporarily for an event or for a time.

The preference meta-model 302 is overlaid on the demand distributionresults in the acceptance distribution 308 for multiple transportationoptions 303—while the demand model distribution 307 is a single curve,the acceptance distribution 308 is a set of distribution curves.

A further model is the intent meta-model 304. For instance, it may bepossible to know how many people would need transportation every unit oftime (say hourly) and also how they would travel. The intent model 304adds what the users plan to do once they reach their destination. Forexample, the users may be going home at the end of a working day, or tofind places to eat (with specificity) or go to a concert, etc. Theintent model 304 usually combines multimodal data from check-ins, travellogs, resolving places data with drop offs (i.e. use intelligent APIs tofigure out the most probable place a person would visit after a drop offat a GPS coordinate).

The intent meta-models 304 are adopted and contextualized resulting inanother set of distributions 309 (an intent model 309) based on intent.The distributions 309 (e.g. Poisson) would be concentrated ordistributed depending on the destination, e.g. if the destination is aball game or a concert there will be a lot of people going there, butthe demand out of a ball game on a summer evening would be distributedto multiple places, say burger joints, an Italian place, many going backto residential neighborhoods, etc.

The simulator module may thus generate a host of demand, transportationoptions, preferences and intent based on models trained from multimodaldata from a plurality of sources and locations, and it can adopt themodels to a specific location and a specific context.

FIG. 4 is a table of example metrics for several types of parties. Inthis table, the party categories are a customer, a mobility service anoperator, a package delivery operator, a city, and a charging stationoperator. There may be more than one party in each category. Anoptimization policy may be formed from e.g. simulations and may forexample be, for a Mobility Operator: Max($, β, ζ, R, M), Min(ρ), for apackage delivery operator: Max(τ1, $, β), Min(ρ).

FIG. 5 conceptually illustrates a simulator module operation flow. Asimulator module 204 takes the demand 307, acceptance 308, and intentmodels 309 as input. The models 307, 308, 309 are developed as describedabove with reference to FIG. 3 . The models 307, 308, 309 may beprovided as distributions (e.g. Poisson distributions) for arrival ratesand arrival volumes at different context. Thus, the arrival rates andarrival volumes may vary from time of day to day of week to special daysas well as different contextual conditions such as for example bridgeclosings or road works. Further, the simulation module 204 receivesoptimization policies 208 for the parties and geographical area.

The simulator module 204 may further take agent models 210 and models203 for state transitions as inputs.

The simulator module outputs a set of data, logs 506 and metrics 508, aswell as model strategies 206. The transactions (e.g. activitiesperformed by the simulator module) during the simulation run are loggedsuch that the simulations may be recreated at a late time. Further, eachsimulation outputs a set of metrics 508. The metrics are described withreference to FIG. 4 . The data and logs may be used by the control unit201 when attempting to reach Nash equilibrium. Further, the modelsstrategies 206 include value iterations and policy declarations forReinforcement learning as well as strategies for Nash equilibriumcalculations under different conditions provided by the context data.The models strategies will be used by the control unit in orchestratingthe resources.

FIG. 6 schematically illustrates an overview of a reinforcement learningprocess in accordance with various embodiments of the inventive concept.

The reinforcement learning module 600 receives the demand 307,acceptance 308, and intent 309 models 606 including strategies from theparties (102, 105), and further regulations and constraints 607, such astraffic rules, and data indicative of the state and transitions 610 ofthe mobility resources. The state and transitions 610 of the mobilityresources may depend on and be adapted by the regulations, policies andregulations 607, and optionally environmental parameters, such as numberof vehicles, customers, etc. 608 in the area. Based on the inputs, thereinforcement learning module 600 provides a set of logs 506 and metrics508.

The reward function of the reinforcement learning module 600 is theoutcome of the Nash equilibrium calculation where the goal of thereinforcement learning module 600 is to find sets of metrics whichsatisfies the Nash equilibrium condition. The Nash equilibriumcalculating module 602 calculates the Nash equilibrium based on the setsof metrics 508 and logs 506, and if the Nash equilibrium is nearequilibrium within the deviation (ϵ) (ϵ may be provided as a numericvalue), control signals are sent to the control unit 201 for controllingthe distribution of mobility resources in the predetermined geographicalarea. The control unit 201 may also feeds back data into thereinforcement learning module 600 for it to learn from real time actualevents. This feedback loop is advantageous for learning because as muchas the simulator module 204 can simulate a large number of conditions,it cannot still fully comprehend how the agents behave in real world.The reinforcement learning module 600 preferably applies deepreinforcement learning algorithms.

Now turning to FIG. 7 , which provides a functional flow-chart for theNash equilibrium calculation and evaluation, the Nash equilibriumcalculating module 602 (which may be a software module) serves as acontroller for seeking Nash equilibrium. Nash equilibrium calculatingmodule 602 receives (S104, S102) logs and metrics 506, 508 from thereinforcement learning module 600. Further, the Nash equilibriumcalculating module 602 receives (S106) models including strategiesincluding e.g. parameters, functions, and rewards needed forestablishing the Nash equilibrium. The Nash equilibrium calculatingmodule 602 may further receive policies and other constraints andregulations 604 related to the predetermined area 100.

The Nash equilibrium is constructed (S110) for a number of possiblescenarios provided from the reinforcement learning module 600, and isbased on the metrics and optimization policies for the parties and theindividual reward functions for the resources (i.e. the autonomousvehicles 103 a-c, 106 a-c). If the Nash equilibrium is not nearequilibrium within a deviation (ϵ) (S603), parameters 612 from the Nashequilibrium (e.g. resulting metrics) are sent to the reinforcementlearning module 600 which may be part of the simulation module. MultipleNash equilibriums may be possible between parties. The alloweddeviations (c) are provided to the reinforcement learning module 600from the parties 606 as part of their model strategies.

If the Nash equilibrium condition is satisfied in S603, then controlsignals are sent to the control unit 201 for controlling thedistribution of mobility resources in the predetermined geographicalarea. The inferences, learned strategies and policies along with themodel representations are stored in the database 624. The system can usethe models during real-time orchestration.

Optionally, the deviation (ϵ) may be adjusted S605 and fed back to theNash equilibrium calculating module 602. This dynamic adjustmentprovides agility and flexibility and the ability to reflect the realworld changing scenarios. Thus, changes in the predetermined are maycause the original ϵ to be too high or too low. Thus, the alloweddeviation (ϵ) may be dynamically adjusted based on real-time feedback.

In addition, the system control unit 201 also feeds back S202 into thereinforcement leaning module 600, for it to learn from real time actualevents. In other words, if it is determined S610 that further scenarioshave been found not yet covered in the reinforcement learning process,this scenario will be provided to the next iteration of reinforcementlearning in the module 600. Additional feedback may be provided S203from the control unit 201 in the form of metrics or parameters or modelartefacts that may have changed due to learning form the real world.

FIG. 8 is a flow chart of method steps according to embodiments of theinvention. A set of metrics is retrieved in step S102, the metrics areindicative of factors of interest related to operation of the resourcesfor at least two parties, each party having a plurality of resources. Instep S104 is optimization policies retrieved and are indicative ofpreferred metric values for each of the at least two parties. Further,in step S106 is a model strategy retrieved for each of the at least twoparties, the model strategies being indicative of acceptable deviationsfrom the optimization policies. Context data from real time systems isretrieved in step S108, the context data is indicative of at least apresent traffic situation. In step S110 is a Nash equilibrium modelledbetween the sets metrics in the optimization policies of the at leasttwo parties taking into account the model strategies and the contextdata. The resources in the geographical area are distributed accordingto the outcome of the modelled Nash equilibrium in step S112.

The Reinforcement Learning mechanisms are applied to infer choice anddemand patterns from simulation runs based on the distributions. It isalso used to learn from feedback from real world situations. Anotherflow that implements the reinforcement learning is to find theanomalies—intentional or a deviation during real world orchestrationflows. The Reinforcement Learning layer consists of codifying the statesfrom the finite state automata, capture the state transitions, derivethe Nash equilibrium at the states we are interested in and then iteratevalue and policy based on the equilibrium values. The parameters thatare part of the context will be changed for each sets of episodes andthat change and associated values/policies would be mapped in thereinforcement layer. The reward and the value are, in fact, function ofthe context—it is this equilibrium seeking reinforcement agents thatgives the system ability to manage when a bridge closes or trafficsurges or at the end of an event. These are all contexts and the rewardsdiffer in each case. Moreover, the rewards are driven by theoptimization policies—for example, the reward under profit maximizationpolicy is different from reward under market share maximization. Infact, the reward function in this patent is a weighted function ofmultiple optimization policies—so one can construct a policy that weighsheavily on the maximization of revenue but also has some importance onmarket share.

The control functionality of the present disclosure may be implementedusing existing computer processors, or by a special purpose computerprocessor for an appropriate system, incorporated for this or anotherpurpose, or by a hardwire system. Embodiments within the scope of thepresent disclosure include program products including machine-readablemedium for carrying or having machine-executable instructions or datastructures stored thereon. Such machine-readable media can be anyavailable media that can be accessed by a general purpose or specialpurpose computer or other machine with a processor. By way of example,such machine-readable media can include RAM, ROM, EPROM, EEPROM, CD-ROMor other optical disk storage, magnetic disk storage or other magneticstorage devices, or any other medium which can be used to carry or storedesired program code in the form of machine-executable instructions ordata structures and which can be accessed by a general purpose orspecial purpose computer or other machine with a processor. Wheninformation is transferred or provided over a network or anothercommunications connection (either hardwired, wireless, or a combinationof hardwired or wireless) to a machine, the machine properly views theconnection as a machine-readable medium. Thus, any such connection isproperly termed a machine-readable medium. Combinations of the above arealso included within the scope of machine-readable media.Machine-executable instructions include, for example, instructions anddata which cause a general purpose computer, special purpose computer,or special purpose processing machines to perform a certain function orgroup of functions.

Although the figures may show a sequence, the order of the steps maydiffer from what is depicted. Also two or more steps may be performedconcurrently or with partial concurrence. Such variation will depend onthe software and hardware systems chosen and on designer choice. Allsuch variations are within the scope of the disclosure. Likewise,software implementations could be accomplished with standard programmingtechniques with rule based logic and other logic to accomplish thevarious connection steps, processing steps, comparison steps anddecision steps.

The person skilled in the art realizes that the present invention by nomeans is limited to the preferred embodiments described above. On thecontrary, many modifications and variations are possible within thescope of the appended claims.

In the claims, the word “comprising” does not exclude other elements orsteps, and the indefinite article “a” or “an” does not exclude aplurality. A single processor or other unit may fulfill the functions ofseveral items recited in the claims. The mere fact that certain measuresare recited in mutually different dependent claims does not indicatethat a combination of these measured cannot be used to advantage. Anyreference signs in the claims should not be construed as limiting thescope.

What is claimed is:
 1. A method for distributing resources in apredetermined geographical area, the method comprising: retrieving, viaa communications connection, at a memory storing instructions executedby a processor to operate as a control unit, a set of metrics from atleast two parties comprising at least two of a mobility serviceprovider, a package delivery operator, a charging station operator, aninfrastructure provider, and a user of any thereof, the set of metricsincluding separate metrics for each of the at least two parties, each ofthe separate metrics indicative of factors of interest related tooperation of the resources for a respective party of the at least twoparties, each party having a plurality of resources, retrieving, via thecommunications connection, at the control unit, optimization policiesfrom the at least two parties, the optimization policies including anoptimization policy for each of the at least two parties, each of theoptimization policies being indicative of preferred metric values forthe respective party, retrieving, at the control unit, models comprisingstrategies for distributing resources of the plurality of resources fromeach of the at least two parties in the predetermined area, the modelsbased on learning from a set of scenarios for distributing resources,wherein the models comprise a demand model, an acceptance model, and anintent model, wherein retrieving the models comprises: constructing thedemand model based on a demand meta-model and context and placeinformation, the demand model adapted to predict transportation need,wherein the demand meta-model is formed by retrieving transportationdata from one or more of a transportation data provider and aninfrastructure data provider, aggregating the demand meta-model acrossmultiple available transportation modes, and training the demandmeta-model for the predetermined area, and wherein the context and placeinformation is retrieved from one or more of a traffic data provider, aninfrastructure data provider, and an event data provider, constructingthe acceptance model based on the demand model, party mobilitypreferences, and transportation options, the acceptance model adapted topredict transportation preferences, and generating the intent modelbased on an arrival coordinate and venue, retrieving, via thecommunications connection, at the control unit, context data from realtime systems, the context data indicative of at least a present trafficsituation, establishing, at the control unit, a Nash equilibrium betweenthe metrics in the optimization policies of the at least two partiestaking into account the models and the context data, via the controlunit, based on an outcome of the established Nash equilibrium,distributing the resources in the geographical area, the distribution ofthe resources optimally satisfying the metrics in the optimizationpolicies of the at least two of the mobility service provider, thepackage delivery operator, the charging station operator, theinfrastructure provider, and the user of any thereof taking into accountthe models and the context data; and via the control unit, directingcertain autonomous vehicles of the resources to explore thepredetermined area with proximity sensors and cameras, receiving modelsfor observed state transitions associated with the certain autonomousvehicles, and incorporating the models for the observed statetransitions into a revised Nash equilibrium.
 2. The method according toclaim 1, wherein the models are based on training with reinforcementlearning algorithms.
 3. The method according to claim 1, comprising:receiving a request for the resources, the request comprising a set ofpreferred metrics, and establishing the Nash equilibrium based furtheron the request.
 4. The method according to claim 3, comprising:providing an offer based on the request and the outcome of theestablished Nash equilibrium, receiving a response to the offer, anddistributing the resources further based on the response.
 5. The methodaccording to claim 1, further comprising calculating adaptation factorsfor a further geographical area not comprised in the predetermined areabased on at least area size and population density at places ofinterest, scaling properties of the model for the predetermined area tothe further area to form an adapted model, and using the adapted modelfor distributing the resources in the further area.
 6. The methodaccording to claim 1, further comprising: training the model based onthe outcome of the distribution of the resources and reinforcementlearning.
 7. The method according to claim 1, wherein the resources aremobility units.
 8. The method according to claim 1, wherein the Nashequilibrium is satisfied when the equilibrium between the sets ofmetrics is within an allowed deviation (ϵ).
 9. The method according toclaim 1, further comprising modifying one or more of the models,re-establishing the Nash equilibrium, and, if a new state of the Nashequilibrium differs from an old state of the Nash equilibrium by morethan a predetermined amount, only then redistributing the resources inthe geographical area based on the re-established Nash equilibrium. 10.A system for distributing resources in a predetermined geographicalarea, the system comprising: memory storing instructions executed by aprocessor to operate as a control unit configured to: retrieve, via acommunications connection, a set of metrics from at least two parties,the set of metrics including separate metrics for each of the at leasttwo parties, each of the separate metrics indicative of factors ofinterest related to operation of the resources for a respective party ofthe at least two parties, each party having a plurality of resources,retrieve, via the communications connection, optimization policies fromat least two parties, the optimization policies including anoptimization policy for each of the at least two parties, each of theoptimization policies being indicative of preferred metric values forthe respective party, retrieve models comprising strategies fordistributing resources of the plurality of resources from each of the atleast two parties in the predetermined area, the models based onlearning from a set of scenarios for distributing resources, wherein themodels comprise a demand model, an acceptance model, and an intentmodel, wherein retrieving the models comprises: constructing the demandmodel based on a demand meta-model and context and place information,the demand model adapted to predict transportation need, wherein thedemand meta-model is formed by retrieving transportation data from oneor more of a transportation data provider and an infrastructure dataprovider, aggregating the demand meta-model across multiple availabletransportation modes, and training the demand meta-model for thepredetermined area, and wherein the context and place information isretrieved from one or more of a traffic data provider, an infrastructuredata provider, and an event data provider, constructing the acceptancemodel based on the demand model, party mobility preferences, andtransportation options, the acceptance model adapted to predicttransportation preferences, and generating the intent model based on anarrival coordinate and venue, retrieve, via the communicationsconnection, context data from real time systems, the context dataindicative of at least a present traffic situation, establish a Nashequilibrium between the metrics in the optimization policies of the atleast two parties taking into account the models and the context data,based on an outcome of the established Nash equilibrium, distribute theresources in the geographical area, the distribution of the resourcesoptimally satisfying the metrics in the optimization policies of the atleast two of the mobility service provider, the package deliveryoperator, the charging station operator, the infrastructure provider,and the user of any thereof taking into account the models and thecontext data; and direct certain autonomous vehicles of the resources toexplore the predetermined area with proximity sensors and cameras,receive models for observed state transitions associated with thecertain autonomous vehicles, and incorporate the models for the observedstate transitions into a revised Nash equilibrium.
 11. The systemaccording to claim 10, the memory further storing instructions executedby the processor to operate as a simulator module configured to:generate the models based on reinforcement learning algorithms.
 12. Thesystem according to claim 11, further comprising a server, wherein thememory storing the instructions executed by the processor to operate asthe control unit and the simulator module is arranged on the server. 13.The method system to claim 10, wherein the control unit is furtherconfigured to modify one or more of the models, re-establish the Nashequilibrium, and, if a new state of the Nash equilibrium differs from anold state of the Nash equilibrium by more than a predetermined amount,only then redistribute the resources in the geographical area based onthe re-established Nash equilibrium.
 14. A computer program productcomprising a non-transitory computer readable medium having storedthereon computer program means comprising instructions stored in amemory and executed by a processor to operate as a control unit fordistributing resources in a predetermined geographical area, wherein theinstructions comprise steps for: retrieving, via a communicationsconnection, a set of metrics from at least two parties comprising atleast two of a mobility service provider, a package delivery operator, acharging station operator, an infrastructure provider, and a user of anythereof, the set of metrics including separate metrics for each of theat least two parties, each of the separate metrics indicative of factorsof interest related to operation of the resources for a respective partyof the at least two parties, each party having a plurality of resources,retrieving, via the communications connection, optimization policiesfrom the at least two parties, the optimization policies including anoptimization policy for each of the at least two parties, each of theoptimization policies being indicative of preferred metric values forthe respective party, retrieving models comprising strategies fordistributing resources of the plurality of resources from each of the atleast two parties in the predetermined area, the models based onlearning from a set of scenarios for distributing resources, wherein themodels comprise a demand model, an acceptance model, and an intentmodel, wherein retrieving the models comprises: constructing the demandmodel based on a demand meta-model and context and place information,the demand model adapted to predict transportation need, wherein thedemand meta-model is formed by retrieving transportation data from oneor more of a transportation data provider and an infrastructure dataprovider, aggregating the demand meta-model across multiple availabletransportation modes, and training the demand meta-model for thepredetermined area, and wherein the context and place information isretrieved from one or more of a traffic data provider, an infrastructuredata provider, and an event data provider, constructing the acceptancemodel based on the demand model, party mobility preferences, andtransportation options, the acceptance model adapted to predicttransportation preferences, and generating the intent model based on anarrival coordinate and venue, retrieving, via the communicationsconnection, context data from real time systems, the context dataindicative of at least a present traffic situation, establishing a Nashequilibrium between the metrics in the optimization policies of the atleast two parties taking into account the models and the context data,based on an outcome of the established Nash equilibrium, distributingthe resources in the geographical area, the distribution of theresources optimally satisfying the metrics in the optimization policiesof the at least two of the mobility service provider, the packagedelivery operator, the charging station operator, the infrastructureprovider, and the user of any thereof taking into account the models andthe context data; and directing certain autonomous vehicles of theresources to explore the predetermined area with proximity sensors andcameras, receiving models for observed state transitions associated withthe certain autonomous vehicles, and incorporating the models for theobserved state transitions into a revised Nash equilibrium.
 15. Thecomputer program product according to claim 14, wherein the models arebased on training with reinforcement learning algorithms.
 16. Thecomputer program product according to claim 14, the instructions furthercomprising steps for: receiving a request for the resources, the requestcomprising a set of preferred metrics, and establishing the Nashequilibrium based further on the request.
 17. The computer programproduct according to claim 16, the instructions further comprising stepsfor: providing an offer based on the request and the outcome of theestablished Nash equilibrium, receiving a response to the offer, anddistributing the resources further based on the response.
 18. Thecomputer program product according to claim 14, wherein the Nashequilibrium is satisfied when the equilibrium between the sets ofmetrics is within an allowed deviation (ϵ).
 19. The computer programproduct according to claim 14, wherein the instructions further comprisesteps for modifying one or more of the models, re-establishing the Nashequilibrium, and, if a new state of the Nash equilibrium differs from anold state of the Nash equilibrium by more than a predetermined amount,only then redistributing the resources in the geographical area based onthe re-established Nash equilibrium.